{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 第二题：决策树处理回归任务"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "实验内容\n",
    "1.\t使用sklearn.tree.DecisionTreeRegressor完成kaggle房价预测问题\n",
    "2.\t计算最大深度为10的决策树，训练集上十折交叉验证的MAE和RMSE\n",
    "3.\t绘制最大深度从1到30，决策树在训练集和测试集上MAE的变化曲线\n",
    "4.  选择一个合理的树的最大深度，并给出理由"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. 读取数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "data = pd.read_csv('data/kaggle_house_price_prediction/kaggle_hourse_price_train.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 丢弃有缺失值的特征（列）\n",
    "data.dropna(axis = 1, inplace = True)\n",
    "\n",
    "# 只保留整数的特征\n",
    "data = data[[col for col in data.dtypes.index if data.dtypes[col] == 'int64']]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Id</th>\n",
       "      <th>MSSubClass</th>\n",
       "      <th>LotArea</th>\n",
       "      <th>OverallQual</th>\n",
       "      <th>OverallCond</th>\n",
       "      <th>YearBuilt</th>\n",
       "      <th>YearRemodAdd</th>\n",
       "      <th>BsmtFinSF1</th>\n",
       "      <th>BsmtFinSF2</th>\n",
       "      <th>BsmtUnfSF</th>\n",
       "      <th>...</th>\n",
       "      <th>WoodDeckSF</th>\n",
       "      <th>OpenPorchSF</th>\n",
       "      <th>EnclosedPorch</th>\n",
       "      <th>3SsnPorch</th>\n",
       "      <th>ScreenPorch</th>\n",
       "      <th>PoolArea</th>\n",
       "      <th>MiscVal</th>\n",
       "      <th>MoSold</th>\n",
       "      <th>YrSold</th>\n",
       "      <th>SalePrice</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>60</td>\n",
       "      <td>8450</td>\n",
       "      <td>7</td>\n",
       "      <td>5</td>\n",
       "      <td>2003</td>\n",
       "      <td>2003</td>\n",
       "      <td>706</td>\n",
       "      <td>0</td>\n",
       "      <td>150</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>61</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>2008</td>\n",
       "      <td>208500</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>20</td>\n",
       "      <td>9600</td>\n",
       "      <td>6</td>\n",
       "      <td>8</td>\n",
       "      <td>1976</td>\n",
       "      <td>1976</td>\n",
       "      <td>978</td>\n",
       "      <td>0</td>\n",
       "      <td>284</td>\n",
       "      <td>...</td>\n",
       "      <td>298</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>5</td>\n",
       "      <td>2007</td>\n",
       "      <td>181500</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>60</td>\n",
       "      <td>11250</td>\n",
       "      <td>7</td>\n",
       "      <td>5</td>\n",
       "      <td>2001</td>\n",
       "      <td>2002</td>\n",
       "      <td>486</td>\n",
       "      <td>0</td>\n",
       "      <td>434</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>42</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>9</td>\n",
       "      <td>2008</td>\n",
       "      <td>223500</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>70</td>\n",
       "      <td>9550</td>\n",
       "      <td>7</td>\n",
       "      <td>5</td>\n",
       "      <td>1915</td>\n",
       "      <td>1970</td>\n",
       "      <td>216</td>\n",
       "      <td>0</td>\n",
       "      <td>540</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>35</td>\n",
       "      <td>272</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>2006</td>\n",
       "      <td>140000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>60</td>\n",
       "      <td>14260</td>\n",
       "      <td>8</td>\n",
       "      <td>5</td>\n",
       "      <td>2000</td>\n",
       "      <td>2000</td>\n",
       "      <td>655</td>\n",
       "      <td>0</td>\n",
       "      <td>490</td>\n",
       "      <td>...</td>\n",
       "      <td>192</td>\n",
       "      <td>84</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>12</td>\n",
       "      <td>2008</td>\n",
       "      <td>250000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 35 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   Id  MSSubClass  LotArea  OverallQual  OverallCond  YearBuilt  YearRemodAdd  \\\n",
       "0   1          60     8450            7            5       2003          2003   \n",
       "1   2          20     9600            6            8       1976          1976   \n",
       "2   3          60    11250            7            5       2001          2002   \n",
       "3   4          70     9550            7            5       1915          1970   \n",
       "4   5          60    14260            8            5       2000          2000   \n",
       "\n",
       "   BsmtFinSF1  BsmtFinSF2  BsmtUnfSF  ...  WoodDeckSF  OpenPorchSF  \\\n",
       "0         706           0        150  ...           0           61   \n",
       "1         978           0        284  ...         298            0   \n",
       "2         486           0        434  ...           0           42   \n",
       "3         216           0        540  ...           0           35   \n",
       "4         655           0        490  ...         192           84   \n",
       "\n",
       "   EnclosedPorch  3SsnPorch  ScreenPorch  PoolArea  MiscVal  MoSold  YrSold  \\\n",
       "0              0          0            0         0        0       2    2008   \n",
       "1              0          0            0         0        0       5    2007   \n",
       "2              0          0            0         0        0       9    2008   \n",
       "3            272          0            0         0        0       2    2006   \n",
       "4              0          0            0         0        0      12    2008   \n",
       "\n",
       "   SalePrice  \n",
       "0     208500  \n",
       "1     181500  \n",
       "2     223500  \n",
       "3     140000  \n",
       "4     250000  \n",
       "\n",
       "[5 rows x 35 columns]"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. 数据集划分"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "70%做训练集，30%做测试集"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.utils import shuffle"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_shuffled = shuffle(data, random_state = 32)\n",
    "split_line = int(len(data_shuffled) * 0.7)\n",
    "training_data = data_shuffled[:split_line]\n",
    "testing_data = data_shuffled[split_line:]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. 导入模型"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import cross_val_predict\n",
    "from sklearn.metrics import mean_absolute_error\n",
    "from sklearn.metrics import mean_squared_error"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.tree import DecisionTreeRegressor"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. 选取特征和标记"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "features = data.columns.tolist()\n",
    "target = 'SalePrice'\n",
    "features.remove(target)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. 训练与预测"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "请你在下面计算树的最大深度为10时，使用训练集全量特征训练的决策树的十折交叉验证的MAE和RMSE  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# YOUR CODE HERE"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###### 双击此处编辑\n",
    "最大深度为10，全量特征的决策树，十折交叉验证指标\n",
    "\n",
    "MAE|RMSE\n",
    "-|-\n",
    "0.0|0.0"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. 改变最大深度，绘制决策树的精度变换图"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "绘制最大深度从1到30，决策树训练集和测试集MAE的变化图，要把两个曲线画在一张图内，横坐标是最大深度，纵坐标是MAE"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "%matplotlib inline\n",
    "plt.style.use(\"fivethirtyeight\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# YOUR CODE HERE"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "请你选择一个合理的树的最大深度，并给出理由\n",
    "###### 双击此处编辑\n",
    "\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
